Title: Disaster Relief Project: Part II
Author: Manpreet Dhindsa
Date: 08/15/21
DS 6030 | Summer 2021 | University of Virginia
This project is meant to help with locating displaced people living in blue-tarp shelters after an earthquake in Haiti. We are using pixel data to help us efficiently and accurately locate those under blue tarps since those are likely individuals that need help.
First, I am using the training data to train the model and determine the tuning parameters and threshold. Then, I will apply all tuned models to the hold out data to calculate performance metrics using the tuning parameters and thresholds determined with the training data.
#-- Load Required Packages
#install.packages("tidymodels")
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.4 ✓ purrr 0.3.4
#> ✓ tibble 3.1.2 ✓ dplyr 1.0.6
#> ✓ tidyr 1.1.3 ✓ stringr 1.4.0
#> ✓ readr 1.4.0 ✓ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
library(broom)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.3 ──
#> ✓ dials 0.0.9 ✓ rsample 0.1.0
#> ✓ infer 0.5.4 ✓ tune 0.1.6
#> ✓ modeldata 0.1.1 ✓ workflows 0.2.3
#> ✓ parsnip 0.1.7 ✓ workflowsets 0.1.0
#> ✓ recipes 0.1.16 ✓ yardstick 0.0.8
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter() masks stats::filter()
#> x recipes::fixed() masks stringr::fixed()
#> x dplyr::lag() masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step() masks stats::step()
#> • Use tidymodels_prefer() to resolve common conflicts.
library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
#>
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:dplyr':
#>
#> combine
#> The following object is masked from 'package:ggplot2':
#>
#> margin
For the first part of this project, it is important to load in the data properly and ensure that Class is a factor variable with two classes: Blue_Tarp or Other since the goal of this project is to locate as many blue tarps as possible. It is not necessary to differentiate between soil, rooftops, vegetation or various non-tarp, so I have grouped and labeled all of them as Other.
#-- Load Data
Haiti = read.csv("HaitiPixels.csv")
Haiti = Haiti[, c('Blue', 'Green', 'Red', 'Class')]
#make Class a factor variable
Haiti$Class= factor(Haiti$Class)
levels(Haiti$Class)
#> [1] "Blue Tarp" "Rooftop" "Soil" "Various Non-Tarp"
#> [5] "Vegetation"
levels(Haiti$Class) = list("Blue_Tarp" = c("Blue Tarp"),
"Other" = c("Rooftop","Soil","Various Non-Tarp","Vegetation"))
Haiti$Class = relevel(Haiti$Class, ref = "Blue_Tarp")
contrasts(Haiti$Class)
#> Other
#> Blue_Tarp 0
#> Other 1
# check for missing data
any(is.null(Haiti))
#> [1] FALSE
#EDA
# Box Plots
par(mfrow=c(1,3))
boxplot(Haiti$Red~Haiti$Class, main = "Boxplot of Red against Classes")
boxplot(Haiti$Green~Haiti$Class, main = "Boxplot of Green against Classes")
boxplot(Haiti$Blue~Haiti$Class, main = "Boxplot of Blue against Classes")
These boxplots are helpful to see that Red has a much wider distribution in the group of all other objects than the blue tarp group. For Green, there are similar results to Red. With both red and green, the distributions between Blue Tarp and Other overlap with each other. However, with Blue, the distributions are more clearly distinct between Blue Tarp and Other. This logically makes sense because the blue tarps should have higher values of blue as compared to the other items.
#summary stats of variables
plot(Haiti$Class)
summary(Haiti$Red)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 48 80 163 163 255 255
summary(Haiti$Green)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 48.0 78.0 148.0 153.7 226.0 255.0
summary(Haiti$Blue)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 44.0 63.0 123.0 125.1 181.0 255.0
From the plot, we can see that this data set is fairly uneven with almost over 60,000 observations in the “Other” group while there are less than 5,000 observations of blue tarps. Additionally, with the summary statistics, Green and Red for the most part are very similar, however, Blue is lower than both for the most part.
When setting up the data, there is no need to split into training and test set because this initial entire data set will be used as the training data set to determine tuning parameters and thresholds. Then, I will use the hold out data to test these models with the established tuning parameters and thresholds.
I have set up functions to easily calculate 10-Fold Cross Validation as well as Confusion Matrix and Performance Metrics values because these functions are used for each of the seven models that I am training on. Therefore, it is more efficient to create a function for these processes.
# Function for 10-Fold Cross Validations
library(tidyverse)
library(glmnet)
library(caret)
set.seed(1)
trControl = caret::trainControl(method="cv",
number=10,
savePredictions=TRUE,
classProbs=TRUE,
allowParallel=TRUE)
# Function for Confusion Matrix and Performance Metrics
calculateConfusionMatrix = function(model, data, threshold) {
prediction = predict(model, data, type='prob')$Blue_Tarp
predClass = as.factor(ifelse(prediction > threshold, 'Blue_Tarp', 'Other'))
return (confusionMatrix(data=predClass, reference=data$Class))
}
With logistic regression, before using cross-validation to determine the tuning parameters, I wanted to use forward, backward, and stepwise tests to see if all three predictors are needed as well as to see if interaction would be necessary for the three predictors.
Therefore, I created a model with all three predictors but no interaction, then a model with interactions between all three predictors, and finally a model with no predictors. I first tested between no predictors and all predictors without interactions to understand if all three predictors are needed. It was established that all three predictors are needed. I then tested no interactions with interactions to see if interaction terms are needed for this model, which showed that no interaction is needed.
#creating model without interactions
reg = glm(Haiti$Class ~ Haiti$Red + Haiti$Green + Haiti$Blue, data = Haiti, family = "binomial")
summary(reg)
#>
#> Call:
#> glm(formula = Haiti$Class ~ Haiti$Red + Haiti$Green + Haiti$Blue,
#> family = "binomial", data = Haiti)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.7911 0.0000 0.0015 0.0205 3.4266
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.20984 0.18455 -1.137 0.256
#> Haiti$Red 0.26031 0.01262 20.632 <2e-16 ***
#> Haiti$Green 0.21831 0.01330 16.416 <2e-16 ***
#> Haiti$Blue -0.47241 0.01548 -30.511 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 17901.6 on 63240 degrees of freedom
#> Residual deviance: 1769.5 on 63237 degrees of freedom
#> AIC: 1777.5
#>
#> Number of Fisher Scoring iterations: 12
#creating model to test interactions
reg2 = glm(Haiti$Class ~ Haiti$Red*Haiti$Green*Haiti$Blue, data = Haiti, family = "binomial")
summary(reg2)
#>
#> Call:
#> glm(formula = Haiti$Class ~ Haiti$Red * Haiti$Green * Haiti$Blue,
#> family = "binomial", data = Haiti)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.9244 0.0000 0.0031 0.0194 3.2526
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 4.598e+01 4.167e+00 11.034 < 2e-16 ***
#> Haiti$Red -3.329e-01 5.287e-02 -6.296 3.06e-10 ***
#> Haiti$Green -2.284e-01 6.469e-02 -3.530 0.000415 ***
#> Haiti$Blue -3.425e-01 4.784e-02 -7.158 8.17e-13 ***
#> Haiti$Red:Haiti$Green 3.826e-03 5.049e-04 7.577 3.54e-14 ***
#> Haiti$Red:Haiti$Blue 1.681e-03 4.135e-04 4.065 4.80e-05 ***
#> Haiti$Green:Haiti$Blue -2.920e-05 2.666e-04 -0.110 0.912786
#> Haiti$Red:Haiti$Green:Haiti$Blue -1.021e-05 1.085e-06 -9.416 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 17902 on 63240 degrees of freedom
#> Residual deviance: 1195 on 63233 degrees of freedom
#> AIC: 1211
#>
#> Number of Fisher Scoring iterations: 14
#creating a model with no predictors
regnull = glm(Haiti$Class ~ 1, data = Haiti, family = "binomial")
#forward, backward, and stepwise between no predictors and model without interactions
#step(reg, scope=list(lower=regnull, upper=reg), direction="backward")
#step(reg, scope=list(lower=regnull, upper=reg), direction="forward")
#step(reg, scope=list(lower=regnull, upper=reg), direction="both")
#use red, green, blue
#forward, backward, and stepwise between without interaction and with interaction
#step(reg, scope=list(lower=reg, upper=reg2), direction="backward")
#step(reg, scope=list(lower=reg, upper=reg2), direction="forward")
#step(reg, scope=list(lower=reg, upper=reg2), direction="both")
I have commented out these lines for the project, but the results indicate that I should use a model of Class ~ Red + Green + Blue as supported by all backwards tests. As observed from all three tests comparing a model with these three predictors to a model without any predictors, I can see that all three predictors are needed. Regarding interaction, it does not seem that any interaction is necessary. I observed different results between forward and both, however, the backwards test supported including all three predictors without any interaction.
# install.packages("caret")
#https://topepo.github.io/caret/model-training-and-tuning.html
set.seed(1)
LogRegModel = train(Class ~ Blue + Red + Green, data=Haiti,
method='glm',
family='binomial',
metric='ROC',
trControl=trControl)
LogRegModel
#> Generalized Linear Model
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9952879 0.9206597
According to https://topepo.github.io/caret/available-models.html, there are no tuning parameters for glm, and the accuracy is very high, which indicates that this may be a useful model for this project.
LogRegResult = thresholder(LogRegModel, seq(0,1, by=0.05), final = TRUE, statistics = "all")
I am using two plots for all models to determine the threshold: a plot of the positive prediction value with a line of the negative prediction value to compare. I am also using a plot of the precision. Therefore, looking at both plots, I will determine the optimal threshold that results in a high value in each of the plots.
plot(LogRegResult$prob_threshold, LogRegResult$`Pos Pred Value`)
lines(LogRegResult$prob_threshold, LogRegResult$`Neg Pred Value`)
plot(LogRegResult$prob_threshold, LogRegResult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold = 0.3
For the Confusion Matrix and Performance Metrics threshold, I am more concerned about False Negatives (classifying an actual blue tarp as not a blue tarp) because I want to make sure that I am capturing all blue tarps and not potentially miss any.
In this case, the threshold that resulted in the highest precision and positive prediction value is 0.3.
cm = calculateConfusionMatrix(LogRegModel, Haiti, threshold)
cm$overall['Accuracy']
#> Accuracy
#> 0.9956516
cm$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.91048467 0.99846453 0.95142119
#> Neg Pred Value Precision Recall
#> 0.99704760 0.95142119 0.91048467
#> F1 Prevalence Detection Rate
#> 0.93050291 0.03197293 0.02911086
#> Detection Prevalence Balanced Accuracy
#> 0.03059724 0.95447460
The values seen here for these metrics are for the most part high performing. The sensitivity is slightly lower at 0.91,but the other values such as accuracy, precision, and specificity are fairly high.
library(ROCR)
pred1 = predict(LogRegModel,Haiti, type = "prob")
pred1 = prediction(pred1$Other, Haiti$Class)
roc1 = performance(pred1,"tpr","fpr")
plot(roc1, lwd = 2)
abline(a = 0, b = 1)
auc1 = performance(pred1, measure = "auc")
auc1@y.values[[1]]
#> [1] 0.9985069
This is a very high value of AUC, and therefore, this model classifies between blue tarps and other material well.
library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
lda=lda(Haiti$Class~ Haiti$Red + Haiti$Green + Haiti$Blue, data=Haiti)
lda
#> Call:
#> lda(Haiti$Class ~ Haiti$Red + Haiti$Green + Haiti$Blue, data = Haiti)
#>
#> Prior probabilities of groups:
#> Blue_Tarp Other
#> 0.03197293 0.96802707
#>
#> Group means:
#> Haiti$Red Haiti$Green Haiti$Blue
#> Blue_Tarp 169.6627 186.4149 205.0371
#> Other 162.7604 152.5808 122.4993
#>
#> Coefficients of linear discriminants:
#> LD1
#> Haiti$Red 0.02896984
#> Haiti$Green 0.02328544
#> Haiti$Blue -0.06923974
plot(lda)
set.seed(1)
LDAModel = train(Class ~ Blue + Red + Green, data=Haiti,
method='lda',
family='binomial',
metric='ROC',
trControl=trControl)
LDAModel
#> Linear Discriminant Analysis
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9839662 0.7533628
According to https://topepo.github.io/caret/available-models.html, there are no tuning parameters for LDA, and the accuracy is fairly strong.
LDAResult = thresholder(LDAModel, seq(0,1, by=0.05), final = TRUE, statistics = "all")
plot(LDAResult$prob_threshold, LDAResult$`Pos Pred Value`)
lines(LDAResult$prob_threshold, LDAResult$`Neg Pred Value`)
plot(LDAResult$prob_threshold, LDAResult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold2 = 0.9
For the Confusion Matrix and Performance Metrics threshold, I am more concerned about False Negatives (classifying a blue tarp as not a blue tarp) because I want to make sure that I am capturing all blue tarps and not potentially miss any.
Therefore, the threshold that results in the highest precision and positive prediction value is 0.9.
cm2 = calculateConfusionMatrix(LDAModel, Haiti, threshold2)
cm2$overall['Accuracy']
#> Accuracy
#> 0.9846302
cm2$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.74183976 0.99264934 0.76923077
#> Neg Pred Value Precision Recall
#> 0.99148325 0.76923077 0.74183976
#> F1 Prevalence Detection Rate
#> 0.75528701 0.03197293 0.02371879
#> Detection Prevalence Balanced Accuracy
#> 0.03083443 0.86724455
The metrics for LDA do not perform as well especially in terms of sensitivity and precision around .75. This is most likely not well-performing enough for the scenario of locating blue tarps.
library(ROCR)
pred2 = predict(LDAModel,Haiti, type = "prob")
pred2 = prediction(pred2$Other, Haiti$Class)
roc2 = performance(pred2,"tpr","fpr")
plot(roc2, lwd = 2)
abline(a = 0, b = 1)
auc2 = performance(pred2, measure = "auc")
auc2@y.values[[1]]
#> [1] 0.9888768
This AUC value is fairly high but does not perform as well as logistic regression.
library(MASS)
qda=qda(Haiti$Class~ Haiti$Red + Haiti$Green + Haiti$Blue, data=Haiti)
qda
#> Call:
#> qda(Haiti$Class ~ Haiti$Red + Haiti$Green + Haiti$Blue, data = Haiti)
#>
#> Prior probabilities of groups:
#> Blue_Tarp Other
#> 0.03197293 0.96802707
#>
#> Group means:
#> Haiti$Red Haiti$Green Haiti$Blue
#> Blue_Tarp 169.6627 186.4149 205.0371
#> Other 162.7604 152.5808 122.4993
set.seed(1)
QDAModel = train(Class ~ Blue + Red + Green, data=Haiti,
method='qda',
family='binomial',
metric='ROC',
trControl=trControl)
QDAModel
#> Quadratic Discriminant Analysis
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.994608 0.9058635
According to https://topepo.github.io/caret/available-models.html, there are no tuning parameters for QDA, and the accuracy is high.
QDAResult = thresholder(QDAModel, seq(0,1, by=0.05), final = TRUE, statistics = "all")
plot(QDAResult$prob_threshold, QDAResult$`Pos Pred Value`)
lines(QDAResult$prob_threshold, QDAResult$`Neg Pred Value`)
plot(QDAResult$prob_threshold, QDAResult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold3 = 0.4
For the Confusion Matrix and Performance Metrics threshold, I am more concerned about False Negatives (classifying a blue tarp as not a blue tarp) because I want to make sure that I am capturing all blue tarps and not potentially miss any.
Therefore, the threshold that results in the highest precision and positive prediction value is 0.4.
cm3 = calculateConfusionMatrix(QDAModel, Haiti, threshold3)
cm3$overall['Accuracy']
#> Accuracy
#> 0.9947344
cm3$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.84767557 0.99959163 0.98562392
#> Neg Pred Value Precision Recall
#> 0.99499203 0.98562392 0.84767557
#> F1 Prevalence Detection Rate
#> 0.91145972 0.03197293 0.02710267
#> Detection Prevalence Balanced Accuracy
#> 0.02749798 0.92363360
The values in this performance table are much better than what was seen for LDA. The sensitivity is still low though.
library(ROCR)
pred3 = predict(QDAModel,Haiti, type = "prob")
pred3 = prediction(pred3$Other, Haiti$Class)
roc3 = performance(pred3,"tpr","fpr")
plot(roc3, lwd = 2)
abline(a = 0, b = 1)
auc3 = performance(pred3, measure = "auc")
auc3@y.values[[1]]
#> [1] 0.9982175
This AUC value is fairly high, so it seems like this model would perform well in terms of classifying.
library(MASS)
library(class)
set.seed(1)
haitiknn = knn(train = data.frame(Variables = Haiti$Red + Haiti$Green + Haiti$Blue), test = data.frame(Variables = Haiti$Red + Haiti$Green + Haiti$Blue), cl = Haiti$Class, prob = T)
set.seed(1)
tuneGrid = expand.grid(k = seq(1, 30, by=2))
KNNModel = train(Class ~ Blue + Red + Green, data=Haiti,
method='knn',
metric='ROC',
tuneGrid=tuneGrid,
trControl=trControl)
KNNModel
#> k-Nearest Neighbors
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 1 0.9967110 0.9463976
#> 3 0.9972012 0.9546289
#> 5 0.9972644 0.9558097
#> 7 0.9971537 0.9542435
#> 9 0.9971063 0.9533767
#> 11 0.9970114 0.9518480
#> 13 0.9970589 0.9526042
#> 15 0.9968849 0.9497777
#> 17 0.9968533 0.9491565
#> 19 0.9969166 0.9501443
#> 21 0.9968533 0.9491692
#> 23 0.9968375 0.9489660
#> 25 0.9966952 0.9465315
#> 27 0.9966478 0.9457363
#> 29 0.9966952 0.9465105
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 5.
ggplot(KNNModel$results, aes(x=k, y=Accuracy)) +
geom_point() + geom_line()
Based on cross-validation, we should use k=5 as the tuning parameter. As shown by the graph, this k value results in the highest accuracy.
KNNresult = thresholder(KNNModel, seq(0,1, by=0.05), final = TRUE, statistics = "all")
plot(KNNresult$prob_threshold, KNNresult$`Pos Pred Value`)
lines(KNNresult$prob_threshold, KNNresult$`Neg Pred Value`)
plot(KNNresult$prob_threshold, KNNresult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold4 = 0.9
Based on the highest precision and positive prediction value, I will use a threshold of 0.9 since that provides the best performance.
cm4 = calculateConfusionMatrix(KNNModel, Haiti, threshold4)
cm4$overall['Accuracy']
#> Accuracy
#> 0.9961892
cm4$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.88081108 1.00000000 1.00000000
#> Neg Pred Value Precision Recall
#> 0.99607875 1.00000000 0.88081108
#> F1 Prevalence Detection Rate
#> 0.93662898 0.03197293 0.02816211
#> Detection Prevalence Balanced Accuracy
#> 0.02816211 0.94040554
Overall, these results seem to perform very well especially in terms of the accuracy, specificity, and precision. The sensitivity is slightly lower at 0.88 though. This still seems like promising results though for this scenario.
library(ROCR)
pred4 = predict(KNNModel,Haiti, type = "prob")
pred4 = prediction(pred4$Other, Haiti$Class)
roc4 = performance(pred4,"tpr","fpr")
plot(roc4, lwd = 2)
abline(a = 0, b = 1)
auc4 = performance(pred4, measure = "auc")
auc4@y.values[[1]]
#> [1] 0.9998582
This ROC curve and AUC value are also very high, so it seems as though this model performs very well at classifing.
library(glmnet)
attach(Haiti)
set.seed(1)
xHaiti = model.matrix(Class~Red+Green+Blue,Haiti)[,-1] #x matrix
yHaiti = Haiti[,1] #y vector
cvridge = cv.glmnet(xHaiti, yHaiti, alpha=0)
cvlasso = cv.glmnet(xHaiti, yHaiti, alpha=1)
plot(cvridge)
plot(cvlasso)
These graphs help to illustrate what value of tuning parameter results in the lowest Mean-Squared Error. For ridge regression, the steepest increase in the MSE occurs between log(lambda) is between 5 and 8. Whereas for lasso, the steepest increase begins earlier when log(lambda) = 3 with the peak at 4.
set.seed(1)
lambdas = 10^seq(1, -3, by=-0.1)
tuneGrid = expand.grid(alpha=0,
lambda=tail(lambdas, -3))
PenLogRegModel = train(Class ~ Blue + Red + Green, data=Haiti,
method='glmnet',
tuneGrid = tuneGrid,
trControl=trControl)
PenLogRegModel
#> glmnet
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> lambda Accuracy Kappa
#> 0.001000000 0.9778466 0.46094788
#> 0.001258925 0.9778466 0.46094788
#> 0.001584893 0.9778466 0.46094788
#> 0.001995262 0.9778466 0.46094788
#> 0.002511886 0.9778466 0.46094788
#> 0.003162278 0.9778466 0.46094788
#> 0.003981072 0.9778466 0.46094788
#> 0.005011872 0.9746841 0.33638569
#> 0.006309573 0.9708733 0.15881995
#> 0.007943282 0.9687228 0.04117784
#> 0.010000000 0.9680271 0.00000000
#> 0.012589254 0.9680271 0.00000000
#> 0.015848932 0.9680271 0.00000000
#> 0.019952623 0.9680271 0.00000000
#> 0.025118864 0.9680271 0.00000000
#> 0.031622777 0.9680271 0.00000000
#> 0.039810717 0.9680271 0.00000000
#> 0.050118723 0.9680271 0.00000000
#> 0.063095734 0.9680271 0.00000000
#> 0.079432823 0.9680271 0.00000000
#> 0.100000000 0.9680271 0.00000000
#> 0.125892541 0.9680271 0.00000000
#> 0.158489319 0.9680271 0.00000000
#> 0.199526231 0.9680271 0.00000000
#> 0.251188643 0.9680271 0.00000000
#> 0.316227766 0.9680271 0.00000000
#> 0.398107171 0.9680271 0.00000000
#> 0.501187234 0.9680271 0.00000000
#> 0.630957344 0.9680271 0.00000000
#> 0.794328235 0.9680271 0.00000000
#> 1.000000000 0.9680271 0.00000000
#> 1.258925412 0.9680271 0.00000000
#> 1.584893192 0.9680271 0.00000000
#> 1.995262315 0.9680271 0.00000000
#> 2.511886432 0.9680271 0.00000000
#> 3.162277660 0.9680271 0.00000000
#> 3.981071706 0.9680271 0.00000000
#> 5.011872336 0.9680271 0.00000000
#>
#> Tuning parameter 'alpha' was held constant at a value of 0
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were alpha = 0 and lambda = 0.003981072.
ggplot(PenLogRegModel$results, aes(x=lambda, y=Accuracy)) +
geom_point() + geom_line()
The tuning parameter that results from cross-validation is alpha = 0 and lambda = 0.003981072, which results in the highest accuracy.
PenLogRegResult = thresholder(PenLogRegModel, seq(0,1, by=0.05), final = TRUE, statistics = "all")
plot(PenLogRegResult$prob_threshold, PenLogRegResult$`Pos Pred Value`)
lines(PenLogRegResult$prob_threshold, PenLogRegResult$`Neg Pred Value`)
plot(PenLogRegResult$prob_threshold, PenLogRegResult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold5 = 0.3
The threshold that provides the highest precision and positive prediction value is 0.3.
cm5 = calculateConfusionMatrix(PenLogRegModel, Haiti, threshold5)
cm5$overall['Accuracy']
#> Accuracy
#> 0.984899
cm5$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.52769535 1.00000000 1.00000000
#> Neg Pred Value Precision Recall
#> 0.98463988 1.00000000 0.52769535
#> F1 Prevalence Detection Rate
#> 0.69083846 0.03197293 0.01687197
#> Detection Prevalence Balanced Accuracy
#> 0.01687197 0.76384768
Although the accuracy, specificity, and the precision values are very high, but the sensitivity at 0.53 may be too low for this scenario.
library(ROCR)
pred5 = predict(PenLogRegModel,Haiti, type = "prob")
pred5 = prediction(pred5$Other, Haiti$Class)
roc5 = performance(pred5,"tpr","fpr")
plot(roc5, lwd = 2)
abline(a = 0, b = 1)
auc5 = performance(pred5, measure = "auc")
auc5@y.values[[1]]
#> [1] 0.9801954
The AUC value is slightly lower than what has been seen with the other algorithms, but it is still overall high.
library(randomForest)
set.seed(1)
sampleH=sample(1:nrow(Haiti),nrow(Haiti)*0.5)
trainH=Haiti[sampleH,1:3]
testH=Haiti[-sampleH,1:3]
trainY=Haiti[sampleH,4]
testY=Haiti[-sampleH,4]
Haitirf=randomForest(trainH,trainY,testH,testY, importance = TRUE)
For cross-validation, because I am using Caret, I will be using the default number of trees = 500. Because 500 is such a large value, this will automatically account for the the tuning parameters that may occur when the number of trees is less than 500.
set.seed(1)
RFModel = train(Class ~ Blue + Red + Green, data=Haiti,
method='rf',
family='binomial',
metric='ROC',
trControl=trControl,
)
#> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
#https://topepo.github.io/caret/available-models.html
RFModel
#> Random Forest
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.9969324 0.9500109
#> 3 0.9967901 0.9476633
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
ggplot(RFModel$results, aes(x=mtry, y=Accuracy)) +
geom_point() + geom_line()
The tuning parameter from cross-validation is mtry = 2 since this results in the highest accuracy.
RFResult= thresholder(RFModel, seq(0,1, by=0.05), final = TRUE, statistics = "all")
plot(RFResult$prob_threshold, RFResult$`Pos Pred Value`)
lines(RFResult$prob_threshold, RFResult$`Neg Pred Value`)
plot(RFResult$prob_threshold, RFResult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold6 = 0.9
The threshold that results in the highest precision and positive prediction value is 0.9.
cm6 = calculateConfusionMatrix(RFModel, Haiti, threshold6)
cm6$overall['Accuracy']
#> Accuracy
#> 0.9971696
cm6$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.91147379 1.00000000 1.00000000
#> Neg Pred Value Precision Recall
#> 0.99708460 1.00000000 0.91147379
#> F1 Prevalence Detection Rate
#> 0.95368693 0.03197293 0.02914249
#> Detection Prevalence Balanced Accuracy
#> 0.02914249 0.95573689
The performance here seems highly promising with a very high accuracy, specificity, and precision value. The sensitivity is slightly lower, but it is above .91, which is still a good performance.
library(ROCR)
pred6 = predict(RFModel,Haiti, type = "prob")
pred6 = prediction(pred6$Other, Haiti$Class)
roc6 = performance(pred6,"tpr","fpr")
plot(roc6, lwd = 2)
abline(a = 0, b = 1)
auc6 = performance(pred6, measure = "auc")
auc6@y.values[[1]]
#> [1] 0.9944983
Also, the AUC value is great for this algorithm, which further illustrates that this may be the best to use for this scenario.
I am using cross validation with three different kernels with library = kernlab (Radial, Linear, and Polynomial) to find the tuning parameters for each.
#install.packages("kernlab")
library(kernlab)
set.seed(1)
SVMModelLinear = train(Class ~ Blue + Red + Green, data=Haiti,
method='svmLinear',
family='binomial',
metric='ROC',
trControl=trControl,
)
SVMModelLinear
#> Support Vector Machines with Linear Kernel
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9954144 0.9228775
#>
#> Tuning parameter 'C' was held constant at a value of 1
With a linear SVM, the tuning parameter that should be used is Cost = 1 since that resulted in the highest accuracy.
library(kernlab)
set.seed(1)
SVMModelPoly = train(Class ~ Blue + Red + Green, data=Haiti,
method='svmPoly',
family='binomial',
metric='ROC',
trControl=trControl,
)
SVMModelPoly
#> Support Vector Machines with Polynomial Kernel
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> degree scale C Accuracy Kappa
#> 1 0.001 0.25 NaN NaN
#> 1 0.001 0.50 NaN NaN
#> 1 0.001 1.00 0.9882988 0.8223133
#> 1 0.010 0.25 0.9887574 0.8262485
#> 1 0.010 0.50 0.9894215 0.8319730
#> 1 0.010 1.00 0.9914614 0.8610221
#> 1 0.100 0.25 0.9950033 0.9142107
#> 1 0.100 0.50 0.9950981 0.9161479
#> 1 0.100 1.00 0.9951772 0.9182245
#> 2 0.001 0.25 NaN NaN
#> 2 0.001 0.50 0.9883462 0.8234671
#> 2 0.001 1.00 0.9883779 0.8235954
#> 2 0.010 0.25 0.9951140 0.9170950
#> 2 0.010 0.50 0.9954460 0.9244588
#> 2 0.010 1.00 0.9959520 0.9339469
#> 2 0.100 0.25 0.9957623 0.9305018
#> 2 0.100 0.50 0.9958571 0.9322602
#> 2 0.100 1.00 0.9957939 0.9310295
#> 3 0.001 0.25 NaN NaN
#> 3 0.001 0.50 0.9885360 0.8264737
#> 3 0.001 1.00 0.9894689 0.8351829
#> 3 0.010 0.25 0.9953353 0.9221985
#> 3 0.010 0.50 0.9958888 0.9330976
#> 3 0.010 1.00 0.9958413 0.9319878
#> 3 0.100 0.25 0.9960311 0.9354391
#> 3 0.100 0.50 0.9959836 0.9346086
#> 3 0.100 1.00 0.9959046 0.9333758
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.
With a polynomial SVM, the tuning parameters that are selected are degree = 3, scale = 0.1 and Cost = 0.25 since this resulted in the highest accuracy.
library(kernlab)
set.seed(1)
SVMModelRadial = train(Class ~ Blue + Red + Green, data=Haiti,
method='svmRadial',
family='binomial',
metric='ROC',
trControl=trControl,
)
SVMModelRadial
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'Blue_Tarp', 'Other'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56917, 56916, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> C Accuracy Kappa
#> 0.25 0.9967268 0.9461569
#> 0.50 0.9968533 0.9482746
#> 1.00 0.9968692 0.9485949
#>
#> Tuning parameter 'sigma' was held constant at a value of 8.297721
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 8.297721 and C = 1.
With a radial SVM, the tuning parameters that are selected are sigma = 8.297721 and Cost = 1 since this resulted in the highest accuracy.
ggplot(SVMModelLinear$results, aes(x=C, y=Accuracy)) +
geom_point() + geom_line()
#> geom_path: Each group consists of only one observation. Do you need to adjust
#> the group aesthetic?
ggplot(SVMModelPoly$results, aes(x=degree, y=Accuracy)) +
geom_point() + geom_line()
ggplot(SVMModelRadial$results, aes(x=sigma, y=Accuracy)) +
geom_point() + geom_line()
SVMLinearResult= thresholder(SVMModelLinear, seq(0,1, by=0.05), final = TRUE, statistics = "all")
plot(SVMLinearResult$prob_threshold, SVMLinearResult$`Pos Pred Value`)
lines(SVMLinearResult$prob_threshold, SVMLinearResult$`Neg Pred Value`)
plot(SVMLinearResult$prob_threshold, SVMLinearResult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold7 = 0.2
SVMPolyResult= thresholder(SVMModelPoly, seq(0,1, by=0.05), final = TRUE, statistics = "all")
plot(SVMPolyResult$prob_threshold, SVMPolyResult$`Pos Pred Value`)
lines(SVMPolyResult$prob_threshold, SVMPolyResult$`Neg Pred Value`)
plot(SVMPolyResult$prob_threshold, SVMPolyResult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold8 = 0.1
SVMRadialResult= thresholder(SVMModelRadial, seq(0,1, by=0.05), final = TRUE, statistics = "all")
plot(SVMRadialResult$prob_threshold, SVMRadialResult$`Pos Pred Value`)
lines(SVMRadialResult$prob_threshold, SVMRadialResult$`Neg Pred Value`)
plot(SVMRadialResult$prob_threshold, SVMRadialResult$Precision)
# using Precision and Positive Prediction value, check for highest value for the threshold
threshold9 = 0.1
cm7 = calculateConfusionMatrix(SVMModelLinear, Haiti, threshold7)
cm7$overall['Accuracy']
#> Accuracy
#> 0.9957622
cm7$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.92680514 0.99803982 0.93981946
#> Neg Pred Value Precision Recall
#> 0.99758356 0.93981946 0.92680514
#> F1 Prevalence Detection Rate
#> 0.93326693 0.03197293 0.02963268
#> Detection Prevalence Balanced Accuracy
#> 0.03153018 0.96242248
The performance here is very high in terms of accuracy, specificity while the precision is slightly lower but still high at 0.93.
cm8 = calculateConfusionMatrix(SVMModelPoly, Haiti, threshold8)
cm8$overall['Accuracy']
#> Accuracy
#> 0.9955567
cm8$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.97972305 0.99607965 0.89194057
#> Neg Pred Value Precision Recall
#> 0.99932809 0.89194057 0.97972305
#> F1 Prevalence Detection Rate
#> 0.93377327 0.03197293 0.03132462
#> Detection Prevalence Balanced Accuracy
#> 0.03511962 0.98790135
This performance seems to be slightly better especially with a very high sensitivity value of 0.98.
cm9 = calculateConfusionMatrix(SVMModelRadial, Haiti, threshold9)
cm9$overall['Accuracy']
#> Accuracy
#> 0.9972012
cm9$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.96389713 0.99830118 0.94934243
#> Neg Pred Value Precision Recall
#> 0.99880696 0.94934243 0.96389713
#> F1 Prevalence Detection Rate
#> 0.95656442 0.03197293 0.03081861
#> Detection Prevalence Balanced Accuracy
#> 0.03246312 0.98109916
The results here are also strong except the sensitivity is not as high as the polynomial and linear SVM.
library(ROCR)
pred7 = predict(SVMModelLinear,Haiti, type = "prob")
pred7 = prediction(pred7$Other, Haiti$Class)
roc7 = performance(pred7,"tpr","fpr")
plot(roc7, lwd = 2)
abline(a = 0, b = 1)
auc7 = performance(pred7, measure = "auc")
auc7@y.values[[1]]
#> [1] 0.9978485
This AUC value is very high, so it seems like it performs well in classifying between the two groups.
library(ROCR)
pred8 = predict(SVMModelPoly,Haiti, type = "prob")
pred8 = prediction(pred8$Other, Haiti$Class)
roc8 = performance(pred8,"tpr","fpr")
plot(roc8, lwd = 2)
abline(a = 0, b = 1)
auc8 = performance(pred8, measure = "auc")
auc8@y.values[[1]]
#> [1] 0.9995656
This AUC value is even higher than what was seen before with SVM linear.
library(ROCR)
pred9 = predict(SVMModelRadial,Haiti, type = "prob")
pred9 = prediction(pred9$Other, Haiti$Class)
roc9 = performance(pred9,"tpr","fpr")
plot(roc9, lwd = 2)
abline(a = 0, b = 1)
auc9 = performance(pred9, measure = "auc")
auc9@y.values[[1]]
#> [1] 0.9967238
This AUC value is also very but not higher than what was seem with SVM polynomial and linear.
At this point, the training data has helped determine the best-performing model and has identified the tuning parameters and thresholds that should be used for each model. Now, I will test the performance of these models against this the holdout data.
When attempting to understand how B1, B2, and B3 correspond to Red, Green, and Blue, I used https://www.rapidtables.com/web/color/RGB_Color.html. Using the blue tarp data, I plugged in the different values for a few observations to see what combination would result in a blue color. From this experimentation, I was able to determine that B1 = Red, B2 = Green, and B3 = Blue. Additionally, I will create histograms to understand this. Since I know that data2 = Blue, I’ll create histograms with that dataset and for red, green, blue. This helps confirm that V8 = Red, V9 = Green, and V10 = Blue. First, I will load in the data then perform the histogram.
When reading in the files, it is important to read in all 7 files. One was removed since that was a duplicate file. Then, extra heading needs to be removed from each one, and unnecessary columns need to be removed, as well. Then, red, green, and blue need to relabel V8, V9, V10 respectively. Finally, the class needs to be labeled appropriately as either blue tarp or other. This is evident because the text files in which the data arrived were labeled as Blue Tarp or NON Blue Tarp.
Once all the files are read in, then they all need to be merged to form one dataset that can be used as the hold out data.
https://stackoverflow.com/questions/34000789/import-txt-file-in-r-ignoring-first-few-lines
#Reading in first one
line1 = readLines("~/Desktop/DS 6030/1orthovnir057_ROI_NON_Blue_Tarps.txt", n = 9)[9]
w1 = diff(c(0, gregexpr("\\S(?=\\s)", paste(line1, ""), perl = TRUE)[[1]]))
data1 = read.fwf("~/Desktop/DS 6030/1orthovnir057_ROI_NON_Blue_Tarps.txt", w1, skip = 8)
https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
data1 = data1[ -c(2:7) ]
https://www.marsja.se/how-to-rename-column-or-columns-in-r-with-dplyr/
names(data1)[names(data1) == 'V1'] = "Class"
names(data1)[names(data1) == 'V8'] = "Red"
names(data1)[names(data1) == 'V9'] = "Green"
names(data1)[names(data1) == 'V10'] = "Blue"
data1$Class = rep(c("Other"))
head(data1)
#> Class Red Green Blue
#> 1 Other 104 89 63
#> 2 Other 101 80 60
#> 3 Other 103 87 69
#> 4 Other 107 93 72
#> 5 Other 109 99 68
#> 6 Other 103 73 53
https://stackoverflow.com/questions/34000789/import-txt-file-in-r-ignoring-first-few-lines
#Reading in second one
line2 = readLines("~/Desktop/DS 6030/2orthovnir067_ROI_Blue_Tarps.txt", n = 9)[9]
w2 = diff(c(0, gregexpr("\\S(?=\\s)", paste(line2, ""), perl = TRUE)[[1]]))
data2 = read.fwf("~/Desktop/DS 6030/2orthovnir067_ROI_Blue_Tarps.txt", w2, skip = 8)
https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
data2 = data2[ -c(2:7) ]
https://www.marsja.se/how-to-rename-column-or-columns-in-r-with-dplyr/
names(data2)[names(data2) == 'V1'] = "Class"
names(data2)[names(data2) == 'V8'] = "Red"
names(data2)[names(data2) == 'V9'] = "Green"
names(data2)[names(data2) == 'V10'] = "Blue"
data2$Class = rep(c("Blue_Tarp"))
head(data2)
#> Class Red Green Blue
#> 1 Blue_Tarp 77 94 99
#> 2 Blue_Tarp 79 98 103
#> 3 Blue_Tarp 77 99 102
#> 4 Blue_Tarp 76 94 99
#> 5 Blue_Tarp 80 95 103
#> 6 Blue_Tarp 80 96 107
https://stackoverflow.com/questions/34000789/import-txt-file-in-r-ignoring-first-few-lines
#Reading in third one
line3 = readLines("~/Desktop/DS 6030/3orthovnir067_ROI_NOT_Blue_Tarps.txt", n = 9)[9]
w3 = diff(c(0, gregexpr("\\S(?=\\s)", paste(line3, ""), perl = TRUE)[[1]]))
data3 = read.fwf("~/Desktop/DS 6030/3orthovnir067_ROI_NOT_Blue_Tarps.txt", w3, skip = 8)
https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
data3 = data3[ -c(2:7) ]
https://www.marsja.se/how-to-rename-column-or-columns-in-r-with-dplyr/
names(data3)[names(data3) == 'V1'] = "Class"
names(data3)[names(data3) == 'V8'] = "Red"
names(data3)[names(data3) == 'V9'] = "Green"
names(data3)[names(data3) == 'V10'] = "Blue"
data3$Class = rep(c("Other"))
head(data3)
#> Class Red Green Blue
#> 1 Other 68 62 58
#> 2 Other 65 62 57
#> 3 Other 65 62 57
#> 4 Other 64 63 57
#> 5 Other 72 64 59
#> 6 Other 72 66 59
https://stackoverflow.com/questions/34000789/import-txt-file-in-r-ignoring-first-few-lines
#Reading in fourth one
line4 = readLines("~/Desktop/DS 6030/4orthovnir069_ROI_Blue_Tarps.txt", n = 9)[9]
w4 = diff(c(0, gregexpr("\\S(?=\\s)", paste(line4, ""), perl = TRUE)[[1]]))
data4 = read.fwf("~/Desktop/DS 6030/4orthovnir069_ROI_Blue_Tarps.txt", w4, skip = 8)
https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
data4 = data4[ -c(2:7) ]
https://www.marsja.se/how-to-rename-column-or-columns-in-r-with-dplyr/
names(data4)[names(data4) == 'V1'] = "Class"
names(data4)[names(data4) == 'V8'] = "Red"
names(data4)[names(data4) == 'V9'] = "Green"
names(data4)[names(data4) == 'V10'] = "Blue"
# Make all values in Class = "Blue Tarps"
data4$Class = rep(c("Blue_Tarp"))
head(data4)
#> Class Red Green Blue
#> 1 Blue_Tarp 88 102 123
#> 2 Blue_Tarp 88 103 126
#> 3 Blue_Tarp 88 103 126
#> 4 Blue_Tarp 89 103 125
#> 5 Blue_Tarp 90 104 126
#> 6 Blue_Tarp 88 104 127
https://stackoverflow.com/questions/34000789/import-txt-file-in-r-ignoring-first-few-lines
#Reading in fifth one
line5 = readLines("~/Desktop/DS 6030/5orthovnir069_ROI_NOT_Blue_Tarps.txt", n = 9)[9]
w5 = diff(c(0, gregexpr("\\S(?=\\s)", paste(line5, ""), perl = TRUE)[[1]]))
data5 = read.fwf("~/Desktop/DS 6030/5orthovnir069_ROI_NOT_Blue_Tarps.txt", w5, skip = 8)
https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
data5 = data5[ -c(2:7) ]
https://www.marsja.se/how-to-rename-column-or-columns-in-r-with-dplyr/
names(data5)[names(data5) == 'V1'] = "Class"
names(data5)[names(data5) == 'V8'] = "Red"
names(data5)[names(data5) == 'V9'] = "Green"
names(data5)[names(data5) == 'V10'] = "Blue"
data5$Class = rep(c("Other"))
head(data5)
#> Class Red Green Blue
#> 1 Other 116 138 120
#> 2 Other 104 121 109
#> 3 Other 106 125 109
#> 4 Other 114 135 109
#> 5 Other 111 133 107
#> 6 Other 107 122 103
https://stackoverflow.com/questions/34000789/import-txt-file-in-r-ignoring-first-few-lines
#Reading in sixth one
line6 = readLines("~/Desktop/DS 6030/6orthovnir078_ROI_Blue_Tarps.txt", n = 9)[9]
w6 = diff(c(0, gregexpr("\\S(?=\\s)", paste(line6, ""), perl = TRUE)[[1]]))
data6 = read.fwf("~/Desktop/DS 6030/6orthovnir078_ROI_Blue_Tarps.txt", w6, skip = 8)
https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
data6 = data6[ -c(2:7) ]
https://www.marsja.se/how-to-rename-column-or-columns-in-r-with-dplyr/
names(data6)[names(data6) == 'V1'] = "Class"
names(data6)[names(data6) == 'V8'] = "Red"
names(data6)[names(data6) == 'V9'] = "Green"
names(data6)[names(data6) == 'V10'] = "Blue"
data6$Class = rep(c("Blue_Tarp"))
head(data6)
#> Class Red Green Blue
#> 1 Blue_Tarp 91 100 132
#> 2 Blue_Tarp 88 108 140
#> 3 Blue_Tarp 85 100 130
#> 4 Blue_Tarp 88 101 135
#> 5 Blue_Tarp 91 104 141
#> 6 Blue_Tarp 94 113 147
https://stackoverflow.com/questions/34000789/import-txt-file-in-r-ignoring-first-few-lines
#Reading in seventh one
line7 = readLines("~/Desktop/DS 6030/7orthovnir078_ROI_NON_Blue_Tarps.txt", n = 9)[9]
w7 = diff(c(0, gregexpr("\\S(?=\\s)", paste(line7, ""), perl = TRUE)[[1]]))
data7 = read.fwf("~/Desktop/DS 6030/7orthovnir078_ROI_NON_Blue_Tarps.txt", w7, skip = 8)
https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
data7 = data7[ -c(2:7) ]
https://www.marsja.se/how-to-rename-column-or-columns-in-r-with-dplyr/
names(data7)[names(data7) == 'V1'] = "Class"
names(data7)[names(data7) == 'V8'] = "Red"
names(data7)[names(data7) == 'V9'] = "Green"
names(data7)[names(data7) == 'V10'] = "Blue"
data7$Class = rep(c("Other"))
head(data7)
#> Class Red Green Blue
#> 1 Other 83 75 65
#> 2 Other 81 75 65
#> 3 Other 84 78 66
#> 4 Other 81 78 65
#> 5 Other 80 79 65
#> 6 Other 79 76 64
https://www.programmingr.com/examples/r-dataframe/merge-data-frames/
final = rbind(data1, data2, data3, data4, data5, data6, data7)
final$Class= factor(final$Class)
levels(final$Class)
#> [1] "Blue_Tarp" "Other"
contrasts(final$Class)
#> Other
#> Blue_Tarp 0
#> Other 1
head(final)
#> Class Red Green Blue
#> 1 Other 104 89 63
#> 2 Other 101 80 60
#> 3 Other 103 87 69
#> 4 Other 107 93 72
#> 5 Other 109 99 68
#> 6 Other 103 73 53
line2b = readLines("~/Desktop/DS 6030/2orthovnir067_ROI_Blue_Tarps.txt", n = 9)[9]
w2b = diff(c(0, gregexpr("\\S(?=\\s)", paste(line2b, ""), perl = TRUE)[[1]]))
data2b = read.fwf("~/Desktop/DS 6030/2orthovnir067_ROI_Blue_Tarps.txt", w2b, skip = 8)
# Removing all other columns except first one and last three
#https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
data2b = data2b[ -c(2:7) ]
hist(data2b$V8)
hist(data2b$V9)
hist(data2b$V10)
#EDA
# Box Plots
par(mfrow=c(1,3))
boxplot(final$Red~final$Class, main = "Boxplot of Red against Classes")
boxplot(final$Green~final$Class, main = "Boxplot of Green against Classes")
boxplot(final$Blue~final$Class, main = "Boxplot of Blue against Classes")
Regarding the box plots, we see similar results to what we saw with the training data set. Red and Green tend to overlap more between the Blue Tarp class and the other class. On the other hand, there is a clear distinction between Blue Tarps and Others in terms of the blue predictor variable.
#summary stats of variables
plot(final$Class)
summary(final$Red)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 27.0 76.0 107.0 118.3 139.0 255.0
summary(final$Green)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 28.0 71.0 91.0 105.4 117.0 255.0
summary(final$Blue)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 25.00 55.00 66.00 82.36 88.00 255.00
This is also similar to what was shown with the training data set with many more observations of the other class as compared to the Blue Tarp class.
As explained above, I will now use the tuning parameters and thresholds that were established above for each model and see how the performance metrics for the hold out data compare to the training data. Additionally, like the training data set, this hold out data does not need to be split into training and test sets because the models will be tested against the entire hold out data set.
# Model Training
# Set up for all 1Models
library(tidyverse)
library(glmnet)
library(caret)
set.seed(1)
# Set up for all Confusion Matrix and Performance Metrics
calculateConfusionMatrix = function(model, data, threshold) {
prediction = predict(model, data, type='prob')$Blue_Tarp
predClass = as.factor(ifelse(prediction > threshold, 'Blue_Tarp', 'Other'))
return (confusionMatrix(data=predClass, reference=data$Class))
}
cmfinal = calculateConfusionMatrix(LogRegModel, final, threshold)
cmfinal$overall['Accuracy']
#> Accuracy
#> 0.976081
cmfinal$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.992058011 0.975964682 0.230993118
#> Neg Pred Value Precision Recall
#> 0.999940782 0.230993118 0.992058011
#> F1 Prevalence Detection Rate
#> 0.374732613 0.007224911 0.007167531
#> Detection Prevalence Balanced Accuracy
#> 0.031029196 0.984011347
The values here for accuracy, sensitivity, and specificity all seem high, but precision is very low at 0.23.
library(ROCR)
pred1f = predict(LogRegModel,final, type = "prob")
pred1f = prediction(pred1f$Other, final$Class)
roc1f = performance(pred1f,"tpr","fpr")
plot(roc1f, lwd = 2)
abline(a = 0, b = 1)
auc1f = performance(pred1f, measure = "auc")
auc1f@y.values[[1]]
#> [1] 0.9994131
The AUC value is very high and seems like this algorithm works very well against the hold out data.
cmfinal2 = calculateConfusionMatrix(LDAModel, final, threshold2)
cmfinal2$overall['Accuracy']
#> Accuracy
#> 0.9840209
cmfinal2$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.722928177 0.985920972 0.272030353
#> Neg Pred Value Precision Recall
#> 0.997958993 0.272030353 0.722928177
#> F1 Prevalence Detection Rate
#> 0.395309756 0.007224911 0.005223092
#> Detection Prevalence Balanced Accuracy
#> 0.019200400 0.854424574
With LDA, the sensitivity is not high at 0.72, and the precision also seems low at 0.27.
library(ROCR)
pred2f = predict(LDAModel,final, type = "prob")
pred2f = prediction(pred2f$Other, final$Class)
roc2f = performance(pred2f,"tpr","fpr")
plot(roc2f, lwd = 2)
abline(a = 0, b = 1)
auc2f = performance(pred2f, measure = "auc")
auc2f@y.values[[1]]
#> [1] 0.9921155
Even though a couple of values in the performance table above were not as strong, the AUC value is very high.
cmfinal3 = calculateConfusionMatrix(QDAModel, final, threshold3)
cmfinal3$overall['Accuracy']
#> Accuracy
#> 0.9958068
cmfinal3$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.715262431 0.997848416 0.707542014
#> Neg Pred Value Precision Recall
#> 0.997927661 0.707542014 0.715262431
#> F1 Prevalence Detection Rate
#> 0.711381276 0.007224911 0.005167707
#> Detection Prevalence Balanced Accuracy
#> 0.007303746 0.856555424
QDA performs slightly better, but the sensitivity is 0.72 along with the precision at 0.71. However, the accuracy and specifity are very high around 0.99.
library(ROCR)
pred3f = predict(QDAModel,final, type = "prob")
pred3f = prediction(pred3f$Other, final$Class)
roc3f = performance(pred3f,"tpr","fpr")
plot(roc3f, lwd = 2)
abline(a = 0, b = 1)
auc3f = performance(pred3f, measure = "auc")
auc3f@y.values[[1]]
#> [1] 0.9915001
The AUC value here is also high at 0.99.
cmfinal4 = calculateConfusionMatrix(KNNModel, final, threshold4)
cmfinal4$overall['Accuracy']
#> Accuracy
#> 0.9950269
cmfinal4$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.681560773 0.997308133 0.648210181
#> Neg Pred Value Precision Recall
#> 0.997681694 0.648210181 0.681560773
#> F1 Prevalence Detection Rate
#> 0.664467261 0.007224911 0.004924216
#> Detection Prevalence Balanced Accuracy
#> 0.007596634 0.839434453
The accuracy and specificity are high, however, the sensitivity and the precision seem low at 0.65.
library(ROCR)
pred4f = predict(KNNModel,final, type = "prob")
pred4f = prediction(pred4f$Other, final$Class)
roc4f = performance(pred4f,"tpr","fpr")
plot(roc4f, lwd = 2)
abline(a = 0, b = 1)
auc4f = performance(pred4f, measure = "auc")
auc4f@y.values[[1]]
#> [1] 0.9494872
This value is still high at 0.95 but not as high as what was seen previously.
cmfinal5 = calculateConfusionMatrix(PenLogRegModel, final, threshold5)
cmfinal5$overall['Accuracy']
#> Accuracy
#> 0.9958013
cmfinal5$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.427900552 0.999934161 0.979295085
#> Neg Pred Value Precision Recall
#> 0.995853543 0.979295085 0.427900552
#> F1 Prevalence Detection Rate
#> 0.595568799 0.007224911 0.003091543
#> Detection Prevalence Balanced Accuracy
#> 0.003156907 0.713917357
This model seems to perform overall except the sensitivity is slightly concerning at 0.43.
library(ROCR)
pred5f = predict(PenLogRegModel,final, type = "prob")
pred5f = prediction(pred5f$Other, final$Class)
roc5f = performance(pred5f,"tpr","fpr")
plot(roc5f, lwd = 2)
abline(a = 0, b = 1)
auc5f = performance(pred5f, measure = "auc")
auc5f@y.values[[1]]
#> [1] 0.987519
This model seems to perform well for the AUC value overall.
cmfinal6 = calculateConfusionMatrix(RFModel, final, threshold6)
cmfinal6$overall['Accuracy']
#> Accuracy
#> 0.994965
cmfinal6$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.417886740 0.999164697 0.784519642
#> Neg Pred Value Precision Recall
#> 0.995778036 0.784519642 0.417886740
#> F1 Prevalence Detection Rate
#> 0.545307079 0.007224911 0.003019194
#> Detection Prevalence Balanced Accuracy
#> 0.003848462 0.708525719
This model has a very high accuracy and high specificity, but the sensivity is very low around 0.43 and precision at 0.78.
library(ROCR)
pred6f = predict(RFModel,final, type = "prob")
pred6f = prediction(pred6f$Other, final$Class)
roc6f = performance(pred6f,"tpr","fpr")
plot(roc6f, lwd = 2)
abline(a = 0, b = 1)
auc6f = performance(pred6f, measure = "auc")
auc6f@y.values[[1]]
#> [1] 0.9777741
The AUC value is similar to others at 0.98.
cmfinal7 = calculateConfusionMatrix(SVMModelLinear, final, threshold7)
cmfinal7$overall['Accuracy']
#> Accuracy
#> 0.9472711
cmfinal7$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.993024862 0.946938152 0.119868952
#> Neg Pred Value Precision Recall
#> 0.999946397 0.119868952 0.993024862
#> F1 Prevalence Detection Rate
#> 0.213915915 0.007224911 0.007174516
#> Detection Prevalence Balanced Accuracy
#> 0.059852997 0.969981507
The accuracy, sensitivity, and specificity all seem to perform well with this model, but the precision is much lower than was seen with other models.
library(ROCR)
pred7f = predict(SVMModelLinear,final, type = "prob")
pred7f = prediction(pred7f$Other, final$Class)
roc7f = performance(pred7f,"tpr","fpr")
plot(roc7f, lwd = 2)
abline(a = 0, b = 1)
auc7f = performance(pred7f, measure = "auc")
auc7f@y.values[[1]]
#> [1] 0.999092
This AUC value is extremely high at 0.999, which shows that this algorithm classifies well against the hold out data.
cmfinal8 = calculateConfusionMatrix(SVMModelPoly, final, threshold8)
cmfinal8$overall['Accuracy']
#> Accuracy
#> 0.9836412
cmfinal8$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.993577348 0.983568855 0.305586236
#> Neg Pred Value Precision Recall
#> 0.999952481 0.305586236 0.993577348
#> F1 Prevalence Detection Rate
#> 0.467413905 0.007224911 0.007178508
#> Detection Prevalence Balanced Accuracy
#> 0.023490939 0.988573102
This model all seems to perform well except the precision seems to be low at 0.31.
library(ROCR)
pred8f = predict(SVMModelPoly,final, type = "prob")
pred8f = prediction(pred8f$Other, final$Class)
roc8f = performance(pred8f,"tpr","fpr")
plot(roc8f, lwd = 2)
abline(a = 0, b = 1)
auc8f = performance(pred8f, measure = "auc")
auc8f@y.values[[1]]
#> [1] 0.9994998
This AUC value is also very high at 0.9995.
cmfinal9 = calculateConfusionMatrix(SVMModelRadial, final, threshold9)
cmfinal9$overall['Accuracy']
#> Accuracy
#> 0.9886956
cmfinal9$byClass
#> Sensitivity Specificity Pos Pred Value
#> 0.609392265 0.991455986 0.341697646
#> Neg Pred Value Precision Recall
#> 0.997141056 0.341697646 0.609392265
#> F1 Prevalence Detection Rate
#> 0.437872171 0.007224911 0.004402805
#> Detection Prevalence Balanced Accuracy
#> 0.012885089 0.800424125
This model seems similar to others in that the accuracy and specificity is high, but the sensitivity and precision are low around 0.3.
library(ROCR)
pred9f = predict(SVMModelRadial,final, type = "prob")
pred9f = prediction(pred9f$Other, final$Class)
roc9f = performance(pred9f,"tpr","fpr")
plot(roc9f, lwd = 2)
abline(a = 0, b = 1)
auc9f = performance(pred9f, measure = "auc")
auc9f@y.values[[1]]
#> [1] 0.9890527
This AUC value also seems to perform overall well.
True positive rate is sensitivity and false positive rate is 1-Specificity. https://algolytics.com/how-to-assess-quality-and-correctness-of-classification-models-part-4-roc-curve/
# cite: https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#Basic_HTML_table
#install.packages("kableExtra")
library(kableExtra)
#>
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#>
#> group_rows
Disaster_Project1 = read.csv("Training Results.csv")
Disaster_Project1 %>%
kbl() %>%
kable_styling()
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
|---|---|---|---|---|---|---|---|
| Log Reg | n/a | 0.9985069 | 0.3 | 0.9952879 | 0.9104847 | 0.0015355 | 0.9514212 |
| LDA | n/a | 0.9888768 | 0.9 | 0.9846302 | 0.7418398 | 0.0073507 | 0.7692308 |
| QDA | n/a | 0.9982175 | 0.4 | 0.9947344 | 0.8476756 | 0.0004084 | 0.9856239 |
| KNN | 5 | 0.9998582 | 0.9 | 0.9961892 | 0.8808111 | 0.0000000 | 1.0000000 |
| Penalized Log Reg | alpha = 0 and lambda = 0.003981072 | 0.9801954 | 0.3 | 0.9848990 | 0.5276954 | 0.0000000 | 1.0000000 |
| Random Forest | mtry = 2 | 0.9944983 | 0.9 | 0.9971696 | 0.9114738 | 0.0000000 | 1.0000000 |
| SVM - Linear | Cost = 1 | 0.9978485 | 0.2 | 0.9957622 | 0.9268051 | 0.0019602 | 0.9398195 |
| SVM - Polynomial | degree = 3, scale = 0.1 and Cost = 0.25. | 0.9995656 | 0.1 | 0.9955567 | 0.9797231 | 0.0039204 | 0.8919406 |
| SVM - Radial | sigma = 8.297721 and Cost = 1 | 0.9967238 | 0.1 | 0.9972012 | 0.9638971 | 0.0016988 | 0.9493424 |
# cite: https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#Basic_HTML_table
#install.packages("kableExtra")
library(kableExtra)
Disaster_Project2 = read.csv("Hold Out Results.csv")
Disaster_Project2 %>%
kbl() %>%
kable_styling()
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
|---|---|---|---|---|---|---|---|
| Log Reg | n/a | 0.9994131 | 0.3 | 0.9760810 | 0.9920580 | 0.0240353 | 0.2309931 |
| LDA | n/a | 0.9921155 | 0.9 | 0.9840209 | 0.7229282 | 0.0140790 | 0.2720304 |
| QDA | n/a | 0.9915001 | 0.4 | 0.9958068 | 0.7152624 | 0.0021516 | 0.7075420 |
| KNN | k = 5 | 0.9494872 | 0.9 | 0.9950269 | 0.6815608 | 0.0026919 | 0.6482102 |
| Penalized Log Reg | alpha = 0 and lambda = 0.003981072 | 0.9875190 | 0.3 | 0.9958013 | 0.4279006 | 0.0000658 | 0.9792951 |
| Random Forest | mtry = 2 | 0.9777741 | 0.9 | 0.9949650 | 0.4178867 | 0.0008353 | 0.7845196 |
| SVM - Linear | Cost = 1 | 0.9990920 | 0.2 | 0.9472711 | 0.9930249 | 0.0530618 | 0.1198690 |
| SVM - Polynomial | degree = 3, scale = 0.1 and Cost = 0.25. | 0.9994998 | 0.1 | 0.9836412 | 0.9935773 | 0.0164311 | 0.3055862 |
| SVM - Radial | sigma = 8.297721 and Cost = 1 | 0.9890527 | 0.1 | 0.9886956 | 0.6093923 | 0.0085440 | 0.3416976 |
To find the best model in cross-validation as well as the hold-out data, I compared AUC values, precision, true positive rates, false positive rates, and accuracy between all algorithms. In the training cross-validation data, the highest accuracy is found in both Support Vector Machines with a Radial kernel (and tuning parameter of sigma = 8.297721 and cost = 1) as well as Random forest (with tuning parameter of mtry = 2). The accuracies for both are 0.9972, however, all of the models had very strong accuracy performance with the lowest accuracy being 0.985, which belongs to LDA and Penalized Logistic Regression (with tuning parameters of alpha = 0 and lambda = 0.00398). In terms of true positive rate, this made it much easier to differentiate between the algorithms. The highest true positive rate is 0.98, which belongs to Support Vector Machines with a polynomial kernel along with SVM - Radial performing very high as well with a TPR = 0.964. The other true positive rates were fairly lower, however, penalized logistic regression had the lowest true positive rate at 0.528. In terms of false positive rates, KNN, Penalized Logistic Regression, and Random Forest all performed well with 0 false positive rate. However, all models were very low with the highest false positive rate being 0.007 which belongs to LDA. Finally, with precision, KNN, Penalized Logistic Regression, and Random Forest, they all had perfect precision at 1. However, all had high performance except for LDA, which had the lowest precision at 0.769. Finally, in terms of AUROC values, they all had very high performance. However, if I did compare, LDA and Penalized Logistic Regression had the worst performance compared to the others, but that value is still around 0.98.
So, with the cross-validation set, the worst-performing model seems to be LDA since it had some of the worst values in these metrics. In terms of the best-performing model, it seems that KNN has the overall best performance. KNN has the highest AUROC value, SVM - Radial has the highest Accuracy, SVM-Polynomial has the highest true positive rate, but KNN, Penalized Logistic Regression, and Random Forest all had perfect false positive rates and precision.
Therefore, it seems that overall KNN performs the best in AUROC, FPR, and Precision while its accuracy is high at 0.996. The only slightly concerning metric is the true positive rate at 0.88, but that is overall a high value.
To find the best algorithm in the hold-out data, I will use the same strategy as finding the best model for the cross-validation. In terms of accuracy, the worst-performing algorithm is SVM-Linear at Cost=1, which equaled 0.947. There was not this low of an accuracy level with the training data set. The highest accuracy from the hold-out data set is around 0.996 with QDA, KNN with k=5, and Penalized Logistic Regression. The true positive rates are also much lower with the hold out data set with the lowest being around 0.42 from Penalized Logistic Regression and Random Forest. However, the highest true positive rate results are around 0.99 from logistic regression, SVM- Linear, and SVM-Polynomial. In terms of false positive rates, they all seem to perform well, but the best performance is with penalized logistic regression at 0.00007. In terms of precision, none of the models seem to perform well except for Penalized Logistic Regression at 0.98. Logistic regression, LDA, SVM-linear, SVM-polynomial, and SVM radial are all under 0.5. Finally, in terms of AUROC values, they all seem to perform very well with the lowest value being KNN at 0.95.
So, with the hold-out data set, QDA had the highest accuracy with Penalized Logistic Regression at a very similar value. SVM-Polynomial had the highest true positive rate, and Penalized Logistic Regression had the best value in terms of FRP and Precision. Finally, SVM-polynomial performs the best in terms of AUROC.
Therefore, in this case with the hold-out data, it seems that Penalized Logistic Regression is performing the best. Compared to the training data set, Penalized Logistic Regression also had strong results with a high AUROC, high accuracy, 0 as the false positive rate, and 1 for precision. The only concerning metric for penalized logistic regression on the training data set is the true positive rate at 0.53. In terms of the training data, KNN had the best performance, and with the hold-out data set, KNN still does perform well in terms of AUROC, Accuracy, and FPR while both the TPR and Precision is just slightly over around 0.68. Therefore, it seems that either of these models will allow us to find more blue tarps to provide relief.
Overall, it seems that the results are fairly similar between the training data set and the hold out data set. All models from both sets result in very high accuracy and high AUROC values. Additionally, both data sets resulted in low false positive rates. It seems that overall the true positive rates and precision from the hold out data set are slightly lower than what I saw with the training data set.
However, it seems that both training and hold out data set seem to favor the same two models: Penalized Logistic Regression and KNN. This makes sense because both sample sizes for the training data set and the hold out data set were fairly large which allows for accurate training and testing. Because of this, these models were able to avoid overfitting or underfitting the data, and therefore, allows for fairly similar results between the two. The main differences I noted outside of the models was the drop in the TPR and the precision. True positive rate is the rate in which an actual blue tarp will test as a blue tarp whereas precision is how close the measurements are to each other. It makes sense that the true positive rate is slightly lower since we are working with real data, and it may not be as clear and distinguishable to find a blue tarp as there are several different shades of blue or different lighting. Precision also makes sense as the data is not as close to each other since it is real data that we cannot control as well as the training data.
For this type of problem, it seems that Penalized Logistic Regression or KNN would perform to best in terms of detecting blue tarps. KNN especially seems to perform well overall because it is mainly meant to work well in regression or classification problems because it works by estimating how likely an observation will be a member of a particular group based on its nearest neighbors. Therefore, because we have a classification problem in which the data is similar to each other and does not need to be scaled, KNN is able to perform very well. Penalized Logistic Regression also seems to perform well by placing a penalty with the addition of more predictors. Even though there are only three predictors, the penalty that is being used as the tuning parameter is resulting in well-performing results for most of the metrics for both data sets.
The metrics that were used for this project are AUROC, Accuracy, TPR, FPR, and Precision. All of these metrics seem highly relevant to the context that we are trying to solve as our goal is to locate as many blue tarps as possible in order to help displaced individuals who are most likely within these tarps. AUROC value is the value of the TPR over the FPR, so this shows the ability of the classifier models to distinguish between the two classes. This is very important for this context since the entire goal is to distinguish between blue tarps and other material. Therefore, AUROC is an important value to consider. Accuracy is also important as that informs us on how accurate the model is, which is important as this is a timely project since we need to locate individuals under blue tarps as efficiently and accurately as possible. True positive rate is also important because it helps with the efficiency as this measures how likely it is that an actual blue tarp will test as a blue tarp, and the false positive rate is also important because that measures how often a different material gets tested and labeled as a blue tarp. It is important for this situation to have a high TPR and and low FPR because we want to make sure that we are finding blue tarps and not accidentally wasting time on false positives or false negatives. I feel as though precision is less important than the other metrics but still important for this project because we want to ensure that the data points are not too spread apart from each other.
Looking at all of the results though, it seems that any of these models would perform well as the AUROC and the Accuracy is high for all models with the training and the hold out data set. As noted previously, AUROC is one of the most important metrics for this type of project because the ability to locate blue tarps against other materials in an accurate fashion is important to do quickly in situations like this. I believe that the next best model for this type of situation would be Random Forest. The values for both the training and the hold out data set appeared to be fairly high for all metrics. Additionally, random forest models help because they reduce overfitting, which can improve the accuracy of the model. Also, it works well with regression and classification problems, and it works well with different types of variables like pixel data. It seems that LDA may be the only algorithm that is not as well-suited for this project.
In order to improve results for this project, I may want to look into more metrics to ensure that I am considering everything that may be helpful in locating displaced individuals under blue tarps. I think that the metrics used for this project have been very informative, but it may be helpful to include more metrics that further explain how successful these models are in classifying blue tarps against other types of material despite the different shades of blue or the lighting. In addition to looking at the metrics, it may be helpful to include more predictors into this model. For example, it may be helpful to somehow use the location coordinates because I assume that individuals may be staying together, so there may be several blue tarps next to each other rather than spread apart. Or, it may be the case that it is difficult to group blue tarps, so blue tarps may be scattered rather than clumped. I think using the location coordinates at least for EDA would help because then we can try to understand where we may be able to find more blue tarps more quickly.
It seems that this data is most well-suited for KNN prediction methods because KNN works well with logistic or classification problems, and it assigns values based on its nearest neighbors. It is also helpful to use KNN for this model because KNN works best when predictor variables are scaled. In this case, all of the predictor variables are the same type of data, so KNN is able to provide a more accurate estimate due to this. In many of the metrics, LDA seemed to perform the worst for this problem. LDA is mainly used to reduce the dimensionality of the predictors. Because there are only three predictors that are pixel data, this may explain why LDA is not well-suited for this data.
The tuning parameters and thresholds that were established with the training data seem to have some similarities. Logistic regression, QDA, Penalized Logistic Regression, and all SVM models have a threshold below 0.5 while LDA, KNN, and Random Forest all had a high threshold at 0.9. The threshold was found by seeing what resulted in the highest precision as well as what resulted in the highest positive prediction value. In this case, I am more concerned about False Negatives (classifying a blue tarp as not a blue tarp) because I want to make sure that I am capturing all blue tarps and not potentially miss any. In that case, it usually makes more sense to lower the threshold. Therefore, for future enhancements, I would like to further understand why 0.9 seemed to perform the best for those three algorithms. In terms of the tuning parameters, the main differences were between the three SVM models. It was interesting that different cost values were established, so for future enhancements, I would like to understand why SVM - Linear and SVM - Radial would have the same cost tuning parameter but SVM - polynomial would be different.
It seems as though most of these models could actually be very useful for locating blue tarps and therefore, help to save human life. By using the pixels from image data and with these models and methods, the ability to locate blue tarps does seem possible with these results. Additionally, the AUROC values and the accuracies that are obtained with any of these methods clearly show that these results are highly effective and that we can be confident with these models especially when using a KNN algorithm. The ability to confidently and quickly classify between a blue tarp and something else is so important in life-and-death situations like this. Therefore, it also helps that KNN can run faster than the Random Forest or SVM algorithms. Choosing a well-suited model that provides well-performing metrics at a fairly quick run time makes me confident that we can use these models, tuning parameters, and thresholds to locate blue tarps from pixel data.